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Abstract 

To date, testing interactions in high dimensions has been a challenging task. 
Existing methods often have issues with sensitivity to modeling assumptions 
and heavily asymptotic nominal p- values. To help alleviate these issues, we 
propose a permutation-based method for testing marginal interactions with a 
binary response. Our method searches for pairwise correlations which differ 
between classes. In this manuscript, we compare our method on real and simu- 
lated data to the standard approach of running many pairwise logistic models. 
On simulated data our method finds more significant interactions at a lower 
false discovery rate (especially in the presence of main effects). On real genomic 
data, although there is no gold standard, our method finds apparent signal and 
tells a believable story, while logistic regression does not. We also give asymp- 
totic consistency results under not too restrictive assumptions. 
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1 Introduction 



In many areas of modern science, massive amounts of data are generated. In the 
biomedical sciences, examples arise in genomics, proteomics, and flow cytometry. 
New high-throughput experiments allow researchers to look at the dynamics of very 
rich systems. With these vast increases in data accumulation, scientists have found 
classical statistical techniques in need of improvement, and classical notions of error 
control (type 1 error) overwhelmed. 

Consider the following two class situation: our data consists of n observations, 
each observation with a known class label of 1 or 2, with p covariates measured per 
observation. Let y denote the n-vector corresponding to class (with ri\ observations 
in class 1 and n 2 in class 2), and X, the n x p matrix of covariates. We often assume 
each row of X is independently normally distributed with some class specific mean 
H y (i\ G W and covariance £ym (for instance in quadratic discriminant analysis). Here, 
we are interested in differences between classes. A common example is gene expression 
data on healthy and diseased patients: the covariates are the genes (p ~ 20, 000), the 
observations are patients (n ~ 100) belonging to either the healthy or diseased class. 
Here, one might look at differences between classes to develop a genetic prognostic test 
of the disease, or to better understand its underlying biology. Recent high dimensional 
procedures have focused on detecting differences between \i\ and /i 2 by considering 
them one covariate at a time. 

In this paper we consider the more difficult problem of testing marginal interac- 
tions. In a fashion similar to the approaches used in large scale testing of main effects 
(see e.g Dudoit et al. 2003] , Tusher et al. 2001| and Efron 2010] ) , we do this on a 



pair by pair basis. 

The standard approach for this problem has been to run many bivariate logistic 
regressions and then conduct a post-hoc analysis on the nominal p-values. Buzkova 



et al. 2011 has a nice summary of the subtle issues that arise in testing for just a 
single interaction in a regression framework. In particular, a permutation approach 
cannot be simply applied because it tests the null hypothesis of both no interaction 
and no main effects at the same time. In the high-dimensional setting with FDR 
estimates, these issues are compounded. 

The logistic regression based methods are all derived from what we call a forward 
model, that is, a model for the conditional distribution of Y\X. In contrast, a back- 
ward model (discussed below) is a model for the conditional distribution of X\Y. We 
propose a method, based on a backwards model, to approach this same problem. By 
using this backwards framework we avoid many of the pitfalls of standard approaches: 
we have a less model-based method, we attack a potentially more scientifically inter- 
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esting quantity, and we can use a permutation null for FDR estimates. Our approach 
is unfortunately only for binary response — the backwards model is more difficult to 
work with for continuous y. 

In this paper we develop our method, and show its efficacy as compared to straight- 
forward logistic regression on real and simulated data. We explain how to deal with 
nuisance variables, and give insight into our permutation-based estimates of FDR. 
We also give some asymptotic consistency results. 



2 Existing Methods 

We begin by going more in-depth on the standard approach and its issues. In general 
one might like to specify a generative logistic model for the data (a forward model) 
of the form 

v 

logit [P( yi = l\X it .)] =Po + J2 fr X tJ + Yl Ti,fc^i A* (!) 

3=1 k<j 

where X^. is the z-th row of X, and test if the 7^ are nonzero in this model. Here i 
indexes the observations and j, k index the predictors. However, because it is a joint 
rather than a marginal model, this does not easily allow us to test individual pairs of 
covariates separately from the others. Furthermore in the scenario with n < p(p+l)/2, 
the MLE for this model is not well defined (one can always get perfect separation) 
and non-MLE estimates are very difficult to use for testing. 

Alternatively, for each pair (X it j,X it k) one might assume a generative logistic 
model of the form 

logit [P( yi = l\X id , X i>k )} =/3 Q + PjX hj + (3 k X itk + j jjk X i}j X ijk (2) 

and estimate or test 7^ using the MLE 7^. 

A standard approach to this problem in the past has been to fit pairwise logistic 
models ^ independently for every pair (j, k), and then use standard tools (ie. asymp- 
totic normality of the MLE) to calculate approximate P- values. Once the p(p — l)/2 



p-values are calculated, the approach of Benjamini and Hochberg 1995 or some other 



standard procedure can be used to estimate/control FDR. 

This approach has a number of problems. First of all, while the approach is very 
model-based, one cannot even ensure that all of the bivariate logistic models are 
consistent with one another (i.e. that there is a multivariate model with the given 
marginals). In particular, model misspecification will often cause over-dispersion 
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resulting in anti-conservative FDR estimates. Also, if the true model contained 
quadratic terms (which we do not have in our model) then for correlated pairs of 
features this approach will compensate by trying to add false interactions. Even if 
we did believe the model, the p-values are only approximate, and this approximation 
grows worse as we move into the tails. 

One might hope to avoid some of these issues by using permutation p-values, 



however, as shown in Buzkova et al. 2011 permutation methods are incongruous with 



this approach — they test the joint null hypothesis of no main effect or interaction, 



which is not the hypothesis of interest. This difficulty is also discussed in Pesarin 



20011. In an attempt to resolve this, Kooperberg and LeBlanc 2008 regress out the 



main effects before permuting the residuals. This is a nice adjustment, but is still 
heavily model-based. 

To deal with these issues, we take a step back and use a different generative model. 
Our generative model has an equivalent logistic model and this correspondence allows 
us to sidestep many of the issues with the standard logistic approach. 

2.1 Forward vs Backward Model 

We propose to begin with a "backward" generative model — as mentioned in Section [TJ 
we assume that observations are Gaussian in each class (xi\yi) ~ N(n y ^, S^)) with 
a class specific mean and covariance matrix. We argue that the most natural test of 
interaction is a test of equality of correlations between groups. 

Toward this end, let us apply Bayes theorem to our backwards generative model, 
to obtain 



P(y = l\x) 



7Ti exp (li 



7r 2 exp (7 2 ) + 7Ti exp(Zi) 

exp [log(7Ti/7r 2 ) + Zi - / 2 ] 
1 + exp [log(7Ti/7r 2 ) + h- k] 



where 



l m = -pfog (27r) /2 - logdet (E m ) /2 - (x - ^^^{x - /x m )/2 

and TT m is the overall prevalence of class m. We can simplify this to 

logit (P) = logdet (E 2 ) /2 - logdet (Ei) /2 + log(7r 1 / 7 r 2 ) + 

- //^E^Vi/ 2 + (Ej- Vi - S 2 V2) T x + x T (E 2 1 - E^ 1 ) x/2. 
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This is just a logistic model with interactions and quadratic terms, and in the form 
of Q (with additional quadratic terms) we have 



From here we can see that traditional logistic regression interactions in the full model 
correspond to nonzero off-diagonal elements of £3 1 ~~ ^i 1 - Testing for non-zero ele- 
ments here is not particularly satisfying for a number of reasons. Because coordinate 
estimates are so intertwined, there is no simple way to marginally test for non-zero 
elements in £3 1 — S^ 1 — in particular there is no straightforward permutation test. 
Also, for n < p the MLEs for the precision matrices are not well defined. 

As in the logistic model ^ we may condition on only a pair of covariates j and 
k in our backwards model. Using Bayes theorem as above, our equivalent bivariate 
forward model is 



where jl m and S m are the mean vector and covariance matrix in class m for only Xj 
and X%. Hence the backwards model has an equivalent logistic model similar to ^ 
but with quadratic terms included as well. One should note that the main effect and 
interaction coefficients in this marginal model do not match those from the full model 
(i.e. the marginal interactions and conditional interactions are different). 

Our usual marginal logistic interaction between covariates j and k corresponds to 
a nonzero off-diagonal entry in £2 1 — ^T 1 - Simple algebra gives 



where R m (j,k) is the correlation between features j and k in class m, and cr m u\ is the 
standard deviation of variable j in class m. 

Thus, if we were to test for "logistic interactions" in our pairwise backwards model, 
we would be testing: 



A, = logdet (E 2 ) /2 - logdet (E x ) /2 + log^/v^) 



+ /4£-V2/2-WErVi/2 

pj = (srVi - s 2 

7j,k= (Ea 1 -^ 1 ),*. 



P(y = l\x = (xj,x k ) T ) = log(7ri/7r 2 ) + /ij^ V2/2 - 
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Now, if <J\(j) = 0"2(j), and c?i(k) = 0"2(fc) ; then this is equivalent to testing if R\(j,k) — 
i?2(j,fc)- If not, then a number of unsatisfying things may happen. For example if the 
variance of a single variable changes between classes, then, even if its correlation with 
other variables remains the same, it still has an "interaction" with all variables with 
which it is correlated. This change of variance is a characteristic of a single variable, 
and it seems scientifically misleading to call this as an "interaction" between a pair of 
features. 

Toward this end, we consider a restricted set of null hypotheses — rather than 
testing for an interaction between each pair of features (j, k), we test the null Ritj,k) = 
R-2(j,k)- Not all logistic interactions will have Ritj,k) R-2(j,k), but we believe this is 
the property which makes an interaction physically/scientifically interesting. 

To summarize, there are a number of issues in the forward model which are alle- 
viated through the use of the backwards model: 

• The marginal forward models are not necessarily consistent (one cannot always 
find a "full forward model" with the given marginals). 

• Omitted quadratic terms may be mistaken for interactions between correlated 
covariates. 

• Interesting interactions are only those for which R\(j t k) 7^ R2(j,k)- 

• P-values are approximate and based on parametric assumptions. 



3 Proposal 

We begin with the generative model described in Section [2T] — we assume observations 
are Gaussian in each class (xi\yi) ~ N[fi y r^,'E y u-\) with a class specific mean and 
covariance matrix. As argued above, we test for interactions by testing 

Hj,k '■ Rl{j,k) = R2(j,k) 

for each j < k, where again, R m (j.k) denotes the (j, k)-th entry of the correlation 
matrix for class m. 

If we were only testing one pair of covariates (j,k), a straightforward approach 
would be to compare the sample correlation coefficients Ri(j,k) to R2(j,k)- in general, 
because the variance of R m (j,k) is dependent on R m (j,k), it is better to make inference 
on a Fisher transformed version of R m (j,k)'- 

U m (j,k) = arctanh (# m (j,fc)) ~ N ^arctanh (R m (j,k)) , - 1 _ ^ ) 
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This is a variance stabilizing transformation. Now, to compare the two correlations 
we consider the statistic 



%*) = - U 2 (j, k )~N ( arctanh (Ri(j,k)) ~ arctanh (-R 2 (j,fc)) , — ^— r H ^— r J 

\ fii - o fi2 - oy 

(3) 

Under the null hypothesis: Ri(j,k) — R-2{j,k), this statistic is distributed iV ^0, + 
To test if the correlations are equal we need only compare our statistic T^,k) to its 
null distribution and find a p-value. While this approach works well for single tests, 
because we are in the high dimensional setting we use a different approach which 
doesn't rely on the statistic's asymptotic normal distribution. 

We are interested in testing differences between two large correlation matrices 
in higher dimensional spaces. We again calculate the differences of our transformed 
sample correlations — we now calculate p(p — l)/2 statistics; one for each pair (j, k) 
with j < k. However to assess significance we no longer just compare each statistic 
to the theoretical null distribution and find a p-value. Instead we directly estimate 
false discovery rates (FDR): we choose some threshold for our statistics, t, and reject 
(/call significant) all (j,k) with (T^)! > t. Clearly, not all marginal interactions 
called significant in this way will be truly non-null and it is important to estimate 
the FDR of the procedure for this cutoff, that is 



FDR = E 



# false rejections 
J£ total rejections 



where '^' is short-hand for "number of". It is standard to approximate this quantity 
by 

E[# false rejections] 
# total rejections 

The denominator is just the number of \Trj,k) I > t (which we know). If we knew which 
hypotheses were null and their distributions then one could find the numerator by 

E[# false rejections] = P(|T (i>fc) | > t) (5) 

(j,k) null 

Clearly we don't know which hypotheses are null. To estimate ^ we propose the 
following permutation approach. 

We first center and scale our variables within class: for each observation we sub- 
tract off the class mean for each feature and divide by that feature's within-class 
standard deviation — let X denote this standardized matrix. This standardization 
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doesn't change our original statistics, T j:k (the correlation calculated from X and X 
are identical), but is important for our null distribution. Now, let n be some ran- 
dom permutation of {l,...,n}. Thus, n(y) is a random permutation of the class 
memberships of the standardized variables (we keep the standardization from before 
the permutation). With these new class labels we calculate a new set of p(p — l)/2 
statistics, {^Q a fc )}j<fc- We can permute our data A times, and gather a large collection 
of these null statistics, (Ap(p — l)/2) of them. To estimate E[# false rejections], we 
take the average number of these statistics that lie above our cutoff 

1 A 

£[# false rejections] = 4{\T^ k) \ > t} 

a=l 

Often, one is interested in the FDR of the I most significant interactions. In this case 
the cutoff, t, is chosen to be the absolute value of the Z-th most significant statistic, 
denoted T(l). We refer to this procedure as Testing Marginal Interactions through 
correlation (TMIcor) and summarize it below. 

TMIcor: Algorithm for Testing Marginal Interactions 

1. Mean center and scale X within each group. 

2. Calculate the feature correlation matrices Ri and R 2 within each class. 

3. Fisher transform the entries (for j < k): U rn (j^) = arctanh (jl m (j,k)J 
and take their coordinate-wise differences: = Ukj^) — U 2 (j t k) 

4. for a = 1, . . . , A execute the following 

(a) Randomly permute class labels of the standardized variables. 

(b) Using the new class labels, reapply steps 2-4 to calculate new statistics 

f <~p*a \ 
l 1 (j,k)Sj<k 

5. Estimate FDR for any / most significant interactions by 

a)Eti#{l^)l>r(0> 



Using this approach, one gets a ranking of pairs of features and an FDR estimate 
for every position in the ranking. Furthermore, rather than testing for interactions 
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between all pairs of variables, one may instead test for interactions between variables 
in one set (such as genes) and variables in another (such as environmental variables). 
To do this, one would only need restrict the statistics considered in steps 3, 46 and 5. 

Standardizing in step (1) before permuting may seem strange, but in this case is 
necessary. If we do not standardize first, we are testing the joint null that the means, 
variances and correlations are the same between classes. This is precisely what we 
moved to the backward model to avoid — by standardizing we avoid permuting the 
"main effects". We discuss this permutation-based estimate of FDR in more depth in 
appendix A. 



4 Comparisons 

In this section we apply TMIcor and the standard logistic approach to real and 
simulated data. On simulated data we see that in some scenarios (in particular with 
main effects) the usual approach has serious power issues as compared to TMIcor. 
Similarly on our real dataset we see that the usual approach does a poor job of finding 
interesting interactions, while TMIcor does well. 



4.1 Simulated Data 



We attempt to simulate a simplified version of biological data. In general, groups of 
proteins or genes act in concert based on biological processes. We model this with a 
block diagonal correlation matrix — each block of proteins/genes is equi-correlated. 
This can be interpreted as a latent factor model — all the proteins in a single block are 
highly correlated with the same latent variable (maybe some unmeasured cytokine), 
and conditional on this latent variable, the proteins are all uncorrelated. In our 
simulations we use 10 blocks, each with 10 proteins (100 total proteins). We simulate 
the proteins for our healthy controls as jointly Gaussian with mean and covariance 
matrix 

••• \ 
R 2 ■■■ 



\0 ••• R 10 J 

where each Ri is a 10 x 10 matrix with Is along the diagonal, and a fixed Pi > for 
all off-diagonal entries. Now, for our diseased patients we again use mean proteins, 
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but change our covariance matrix to 



(Rx ••• \ 
R 2 ■■■ 

\0 ■■■ R W J 

where R± has Is on the diagonal and p\ for all off-diagonal entries (with < p\ ^ pi). 
This correlation structure would be indicative of a mutation in the cytokine for the 
first group causing a change in the association between that signaling protein and the 
rest of the group. 

Within each class (diseased and healthy) we simulated 250 patients and applied 
TMIcor and the usual logistic approach. We averaged the true and estimated false 
discovery rates of these methods over 10 trials. As we can see from Figure [T] TMIcor 
outperforms the logistic approach. This difference is particularly pronounced in the 
second plot of Figure [T] In this plot, because the correlations are large but different 
in both groups (pi = 0.3, p\ = 0.6), there are some moderate quadratic effects in the 
true model — this induces a bias in the logistic approach and its FDR suffers. In 
contrast, these quadratic effects are not problematic in the backward framework. 

We also consider a second set of simulations. This set used p^ = 0.3 for all i and 
Pi = 0. However, instead of mean in both classes, we set the mean for all proteins in 
block 1 for diseased patients to be some pi (> 0). The results are plotted in Figure [2] 
This mean shift had no effect on TMIcor (the procedure is meanshift invariant), 
but as the mean difference grows, it becomes increasingly difficult for the logistic 
regression to find any interactions. This issue is especially important as, biologically, 
one might expect that genes with main effects to be more likely to have true marginal 
interactions (and these interactions may also be more scientifically interesting). 

While these simulations are not exhaustive, they give an indication of a num- 
ber of scenarios in which TMIcor significantly outperforms logistic regression. More 
exhaustive simulations were run and the results mirrored those in this section. 



4.2 Real Data 

We also applied both TMIcor and logistic regression to the colitis gene expression 
data of |Burczynski et"aL 2006|. In this dataset, there are 127 total patients, 85 with 
colitis (59 Crohn's patients + 26 ulcerative colitis patients) and 42 healthy controls. 
We restricted our analysis to the 101 patients without ulcerative colitis. Each patient 
had expression data for 22283 genes run on an Affymetrix U133A microarray. Because 
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Figure 1: Plots of estimated and true FDR for TMIcor and logistic regression averaged 
over 10 trials. Error bars contain the mean value ± 1 se of the mean. For controls, 
Pi = 0.3 for all i. On the left p\ = 0, while on the right p\ = 0.6. There is no main 
effect in either panel. 



chromosomes 5 and 10 have been indicated in Crohn's disease, we enriched our dataset 
by using only the genes on these chromosomes, along with the NOD2 and ATG16L1 



genes (chromosomes as specified by the CI geneset from Subramanian et al. 2005] 



In total 663 genes were used. Some of these genes were measured by multiple probesets 
- the final expression values used for those genes were the average of all probesets. 
From these 663 genes we have 219, 453 of interactions to consider. Figure [3] shows 
the estimated FDR curves for the two methods. TMIcor finds many more significant 
interactions — at an FDR cutoff of 0.1, TMIcor finds 2570 significant interactions, 
while the logistic approach finds 15. The significant 15 from the logistic approach 
may not even be entirely believeable — the smallest p-value of the 15 is roughly 
1/219453, which is what we would expect it to be if all null hypotheses were true. 
Because the smallest p-value is large, we see that the FDR for logistic regression 
begins surprisingly high. The FDR subsequently drops because there are a number 
of p-values near the smallest, however, the significance of these hypotheses is still 
suspect. 

Unfortunately interpreting 2570 marginal interactions is difficult (even if all are 
true). Toward this end we consider the graphical representation of our analysis in 
Figure |4j Each gene is a node in our graph, and edges between genes signify marginal 
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Figure 2: Plots of estimated and true FDR for TMIcor and logistic regression averaged 
over 10 trials. Error bars contain the mean value ± 1 se of the mean. For both plots 
pi = and pi = 0.3 for all i. Both panels have main effects — on the left jli — \L\ = 0.5, 
while on the right jli — /ii = 1. 



interactions. In this plot we considered only the 1250 of the 2570 significant marginal 
interactions indicative of a decrease in correlation from healthy control to Crohn's 
(ie. Tj t k > 0). There is one large connected component, a few connected pairs and 
a large number of isolated genes. The connected component appears to be split into 
2 clusters. To get a better handle on this, we considered a more stringent cutoff for 
significant interactions — at an FDR cutoff of 0.03, we are left with 832 significant 
interactions of which only 402 have Tj^ > 0. We plot this graph in Figure [5j we see 
that our large connected component has divided into 2. From here we further zoomed 
in on each component (now displaying only the 50 most significant interactions per 
component), and can actually see which genes are are most important (in figure [6J. 

It appears, from this analysis, that there are two genetic pathways which are 
modified in Crohn's disease. Many of the genes in each cluster are already known 
to be indicated in Crohn's, but to our knowledge these interactions have not been 
considered. 
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Figure 3: Corhn's data; FDR estimates for TMIcor and logistic approaches for the 
5000 most significant marginal interactions 




Figure 4: Graph of 1250 marginal interactions (with decreasing correlation) significant 
at FDR cutoff of 0.1. Genes with no significant interactions not shown 
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Figure 5: Graph of 402 marginal interactions (with decreasing correlation) significant 
at FDR cutoff of 0.03. Genes with no significant interactions not shown 




Figure 6: Graphs of the top 50 marginal interactions in each cluster (and correspond- 
ing genes) 
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5 Dealing with Nuisance Variables 



Often, aside from the variables of interest, one may believe that other nuisance vari- 
ables play a role in complex interactions. For example, it seems reasonable that many 
genes are conditionally independent given age, but are each highly correlated with 
age. Ignoring age, these genes would appear to be highly correlated, but this cor- 
relation is uninteresting to us. TMIcor can be adapted to deal with these nuisance 
variables provided there are few compared to the number of observations, they are 
continuous, and they are observed. 

We resolve this issue by using partial correlations. Assume Xj and Xk are our vari- 
ables of interest, and z is a vector of potential confounders. Rather than comparing 
cor (xj, Xk) in groups 1 and 2, we compare the partial correlations, cor , [ar^z]). 

This is done by first regressing our potential confounders, Z, out of all the other 
features, then running the remainder of the analysis as usual. 

To adapt the original algorithm in Section [3] to deal with nuisance variables we 
need only replace step (1) by: 

1. Replace our feature matrices X\ and X2 by 

X m = [/ — Z m {Z^Z^ Zj^] X m 

Now, mean center and scale X within each group. 

We give more details motivating this approach and discussing potential computa- 
tional advantages in appendix B. 

6 Asymptotics 

In this section we give two asymptotic results. We show that if n — > 00, and log ^ n — > 0, 
then under certain regularity conditions our procedure for testing marginal interac- 
tions (in the absence of nuisance variables) is asymptotically consistent — with prob- 
ability approaching 1 it calls significant all true marginal interactions and makes no 
false rejections. Furthermore, using the permutation null, it also consistently esti- 
mates that the true FDR is converging to 0. Because we only need lo ^ Pn — > 0, p n may 
increase very rapidly in n. 

We first give a result showing that for sub-Gaussian variables our null statistics 
converge to and our alternative statistics are asymptotically bounded away from 0. 
The proof of this theorem is based on several technical lemmas which we relegate to 
appendix C. 
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Theorem 6.1 Let and x 2 (j), j = 1, . . . be random variables. Assume there is 
some C > such that for all t > 

P {\x m(j) - E[x m[j) } \ >t)< exp (1 - t 2 /C 2 ) 

for each m = 1,2 . Let denote the mean of x m (j) and o ,2 n r-s its variance. For 
each i < oo, let i^.j be independent realizations with the same distribution as x m (.). 

Let p n be a sequence of integers such that lo ^ Pn — > 0. Let R m be the correlation 
"matrix" (an infinite but countably indexed matrix) of the covariates from group m. 
Let I denote the set of ordered pairs (j, k) for which Ri(j,k) R-2(j,k), an d C n denote 
the set of ordered pairs (j, k) with j, k < p n . 

Assume for every m and j, cr^U) — a min (f or some °~min > 0)- Furthermore, 
assume that for all (j, k) in each I, \Ri(j,k) — R2(j,k)\ > ^min f or some A min > and 
that form = 1,2, sup i<fc \R m (j,k) \ < Pmax for some fixed p max < I. 

Now, given any e p > 0, and < t < A min , if we choose n sufficiently large, then 
with probability at least 1 — e p 

\ T (j,k)\ < t 

for all (j, k) in C n — /, and 

\ T U,k)\ > t 

for all (j, k) in C n n I. 

The notation here is a little bit tricky, but the result is very straightforward: 
under some simple conditions, we find all marginal interactions and make no false 
identifications. 

While there were a number of assumptions in the above theorem, most of these 
are fairly trivial and will almost always be found in practice: the variance must be 
bounded away from and the correlations bounded away from ±1. The assumption 
that the correlation differences are bounded below by a fixed A min for true marginal 
interactions is a bit more cumbersome, but may easily be relaxed to A min — > at a 
slow enough rate that A min / [log p/n] 1 ^ 2 — > oo. 

The astute reader might note that our assumption bounding the variance away 
from seems strange — the distribution of the sample correlation is independent 
of the variance. This is necessary only because we assumed the covariates have a 
subgaussian tail with a shared constant C. One could have relaxed the bounded 
variance assumption to the assumption that {xj/(jj}j =1 have a sub-Gaussian tail 
with a shared constant C. 
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6.1 Permutation Consistency 

Now that we have shown our procedure has FDR converging to 0, we would like 
to show that it asymptotically estimates FDR consistently as well. In particular 
we show that as n — > oo, if — > 0, then with probability approaching 1, for a 
random permutation, our permuted statistics converge to uniformly in probability 

(max^fc fc ) < t for any fixed t > with probability converging to 1). Thus our 

estimated FDR converges to under the same conditions as our true FDR. 

We begin with some notation. Let us consider an arbitrary permutation of class 
labels, II. Let rr denote the proportion of observations from class 1 that remain in 
class 1 after permuting. 

We discuss a somewhat simplified procedure in our proof, as otherwise the algebra 
becomes significantly more painful (without any added value in clarity), but it is 
straightforward to carry the proof through to the full procedure. In our original 
procedure, after permuting class labels we recenter and rescale our variables within 
each class. Because we already centered and scaled variables before permuting, this 
step will have very little effect on our procedure (though it does have the nice effect 
of never giving \p*\ > 1). In this proof we consider a procedure identical in every way 
except without recentering and rescaling within each permutation. 

Before we give the theorem, we would like to define a few new terms for clarity. 
For a given permutation II, let IL(m) G {0,1} be the permuted class of the 2-th 
observation originally in class m. Furthermore, let IT (to, I) be the set of observations 
in class to that are permuted to class I, and let II (•, /) be the set of observations in 
both classes permuted to class /, ie. 

U(m,l) = {i : Ui(m) = 1} 
U(-,l) = {(i,m) : IL(m) = /} 

Now, we give a result which shows that for any fixed t > if our variables are 
sub-Gaussian with some other minor conditions, then for n — > oo and log p/n — > 
with probability approaching 1, none of our permuted statistics will be larger than t, 
or in other words, as our true converged to 0, so will our estimated FDR 0. As before, 
the proof of this theorem is based on several technical lemmas which we again leave 
to appendix C. 

Theorem 6.2 Let and x 2 (j), j = 1, . . . be random variables with 

P(\x m{j) -E[x m(j) ] | > t) <l-e t2 /° 
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for all t > 0, and each m = 1,2, with some fixed C > 0. Let jJ. m {j) denote the mean 
of x m (j) an d a m(j) ^ S variance. For each i < oo, let x m (j i .) be independent realizations 
with the same distribution as £ m (.). 

Let p n be a sequence of integers such that lo ^ Pn — > 0. Let R m be the correlation 
"matrix" (an infinite but countably indexed matrix) of the covariates from class m. 

Assume for every m, j, o~ 2 m ^ > (y\ lin (for some cr^ in > 0). Furthermore, assume 

that for m = 1,2, sup i<fc \R m (j,k) \ < Pmax for some fixed p max < 1. 

Now, given any e p > 0, and < t, if we choose n sufficiently large and let U be a 
random permutation, then with probability at least 1 — e p 

\ T m\<t 

for all (j, k) with j, k <p n where 

T (j,k) = arctanh (-R perm: i(j,k)) - arctanh (-R per m:2(j,k) 



and 



n permMm 2^ I £ ) V 

(i,z)en(-,m) v ' - v 



The notation is again somewhat ugly, but the result is very straightforward: under 
some simple conditions, our permuted statistics are very small. In particular from 

the proof one can see that sup ^ j = O p (^y^log Pn/nj . 

Note there is an implicit indexing of n in Rperm:m(j,k) (h seemed unneccessary to 



add more indices). As in theorem 6.1 some of our conditions may be relaxed. Instead 
of bounding cr| below, we need only bound Caj below. Also, rather than choose a 

1 /2 

fixed cutoff, t > 0, we may use any sequence {t n } with t n / (log p n /n) 1 — > oo. Also, 
as noted before, the result we have just shown ignores the restandardizing within 
each permutation, however it is straightforward (though algebraicly arduous, and not 
insightful) to extend this result to that well. 



As a last note, in theorem 6.2, we gave our consistency result for only a single 
permutation. This result can easily be extended to any fixed number of permutations 
using a union bound. This was left out of the original statement/proof as the notation 
is already clunky and the extension is straightforward. 



Through theorems 6.1 and 6.2 we have shown that, under fairly relaxed conditions, 
our procedure is asymptotically consistent at discovering marginal interactions and 
that the permutation null reflects this. 
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7 Discussion 



In this paper we have discussed marginal interactions for logistic regression in the 
framework of forward and backward models. We have developed a permutation based 
method, TMIcor, which leverages the backward model. We have shown its efficacy 
on real and simulated data and given asymptotic results showing its consistency and 
convergence rate. We also plan to release a publically available R implementation. 

8 Appendix A 

In this section we give more details on our permutation-based estimate of FDR, 
and discuss a potential alternative. Recall that we are using the permutations to 
approximate 

E p (i%*)i>*)- ( 6 ) 

(j,k) null 

For the moment, assume that all covariates in both classes have mean and variance 
1, and that we did not do any sample standarization. Then, under the null hypothesis 
that Ri(j t k) — R2(j,k)i T(j t k) calculated under the original class assignments and 
calculated under any permuted class assignments have the same distribution, so 

E P (\T( j ,k)\>t)= e P (\T* m \>t) 

(j,k) null Oifc) nu H 

which is reasonably (and unbiasedly) approximated by 

E iEmi&)i>*). 

(j,k) null a=l 

Because we do not know which genes are null, our actual estimate of ^ is 

EiE J (i T ^)i>*)= E jJ2mr,k } \>t) ( 7 ) 

(j,k) a=l null a=l 

+ E jt,mr,k)\>t) (8) 

(j,k) alternative a=l 

This gives a slight conservative bias (especially small if most marginal interactions 
are null). One should also note that unlike the null statistics, for the alternative 
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(J, k), TT- k -, are not distributed N (0, ^zg); they are still mean 0, but the variance is 
increased. However, this conservative bias is very slight — in general there are few 
alternative hypotheses, and the variance increase is not large. 

Because in practice we do not have mean 0, variance 1 for all covariates in both 
classes, we must standardize before running our procedure. Otherwise, instead of 
testing for a changing correlation, we are actually testing for a different mean, vari- 
ance, or correlation between classes. The effect of standardizing with the sample mean 
and variance rather than the true values is asymptotically washed out, and while the 
variance of our tests is increased for small samples, this increase is only minimal. 



An alternative to permutations, as discussed in Efron 2010) , is to directly estimate 
the numerator using the approximate theoretical distribution of the null statistics. 

Each null statistic is asymptotically iV ^0, + ^z^j, so f° r (j, k) null 



P{\T m \>t) = 2<S> 



t(n 1 - 3)(n 2 - 3) 



n 1 +n 2 - Q 

Now we can conservatively approximate the quantity in Eq ^ by 
P (l%fc)l > t) < vip - l)/2 ■ P (|T null | > t) 

(j,k) null 

t(n 1 - 3)(n 2 - 3) 



p{p - 1) • $ 



n\ + n 2 — 6 



While this approach is reasonable and simple, it is less robust than using permuta- 
tions, and in practice, even for truly Gaussian data, it is only slightly more efficient. 



9 Appendix B 

Before proceeding, we remind the reader that x are our variables of interest and 
z are potential confounding variables. Furthermore we are interested in comparing 
cor ([xjl-z] , [a?jfe|;z]) between groups. From basic properties of the Gaussian distribution 
we know that 

x\z ~ N [fx x + E^EJ 1 - fi z ) , £(x[z)] 

where ^t x \z) i s the variance/covariance matrix of x given z, ^( x ,z) is the covariance 
matrix between x and z, E^ is the variance matrix of z, and \i x and \i z are the means 
of x and z. Now, if /x 2 , E( X)Z ), and E z were known, then the MLE for ^t x \ z ) would 
be 

S(,| Z) = -[x-i f i T x -(z- ifi) s; 1 E ( ^ ) ] T [x - ip T x -{z- inl) z z % z , x) ] . 
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Unfortunately, these nuisance parameters are unknown. However we can also estimate 
them by maximum likelihood. This gives us the estimate 



*{X\Z) = 



n 
1 

n 



X - Z '/. 7.\ 1 Z T X 



x - z ' z T z) 1 Z T X 



where Z is the standardized version of Z, and X is the standardized version of X, 
and P^ ± is the projection onto the orthogonal complement of the column space of 
Z. So, our estimate of partial correlation is just an estimate of correlation with Z 
regressed out of both covariates. We use this to contruct our permutation null. In 
the orginal algorithm, we mean centered and scaled before permuting; here we do the 
equivalent — we project our variables of interest onto the orthogonal complement of 
our nuisance variables, and then center/scale them. Now we are ready to permute. 
We permute these "residuals", and calculate permuted correlations as before. 

Before proceeding, we note that for n sufficiently large n (n » p) one might use 
a similar approach to consider partial correlations rather than marginal correlations 
in our original algorithm (conditioning out all covariates except any particular 2). 
However, in general n « p and thus P ± = rendering this approach ineffective 
- this approach only works for nuisance variables because we assume that there are 
very few relative to the number of observations. 

As stated in the text, to adapt the original algorithm to deal with nuisance vari- 
ables we need only replace step (1) by: 

1. Replace our feature matrices X\ and X 2 by 

X m — \l — Z m [Z jn Z rr ^ Z m ] X m 

Now, mean center and scale X within each group. 

One may note that we only calculate X once per class, at the beginning of our 
procedure, not in each permutation. We do this for a similar reason that we stan- 
dardize our variables before permuting — because we are not testing the hypothesis 
that the relationship between X and Z is the same in both groups. If we relcalulate 
after each permutation then we are implicitly assuming that this relationship is the 
same in both groups under the null. 

Even with nuisance variables this approach is very computationally fast. Pro- 
jecting our original variables onto Z _L can be done in O (npp nu i S ) operations where 
p nu i s is the number of nuisance variables. Thus the total runtime of this algorithm is 
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O (npp nu i S + Anp(p — l)/2) where A is the number of permutations — this is domi- 
nated by the second term, which is independent of the number of nuisance parameters. 
In contrast, if we were to use the standard approach (fitting pairwise logistic regres- 
sions with nuisance variables), its runtime would be O [(iter) (3 + Pmiis) 2, np(p ~ l)/2] 
where iter is the number of iterations of the algorithm for finding the MLE. In general 
A ~ 100 and iter ~ 5. Now, since (3 +p nu i S ) 2 grows very quickly in p nu i S , for even a 
small number of nuisance parameters the logistic approach becomes much slower. 



10 Appendix C 



This appendix contains the technical details from the theorems in section 7 of the 
main manuscript. We begin with a number of technical lemmas: 

First, as one might imagine, if we can consistently estimate our correlation ma- 
trices, applying a Fisher transformation should not change much. We formalize this 
with the next lemma. 

Lemma 10.1 Let Ri, R2 be correlation matrices, and Ri, R 2 be estimates of R\ and 
R 2 . 

Let I be the set of ordered pairs (j, k) where Ri(j,k) 7^ R2(j,k)- Assume for all (j, k) 
in I, \Ri(j,k) ~ R2(j.k)\ > Amin for some A min > and that for m = 1,2 we have 
sup i<fc ||-Rm(j.fc)IL < AW for some fixed p max < 1. 



Further assume that for m = 1,2, 



Rm, — R 11 



Then for all (j, k) in I c with j ^ k we have 

arctanh ^i?i(j,&)J — arctanh (jli&k) 
and for all (j, k) in I with j ^ k we have 

arctanh [Ri(j,k)) ~ arctanh [R,2(j,k) 



< 5 (for some 5 < 1 - p ma J 
25 



< 



1 - (Pmax + 5Y 



> A r 



25 



(9) 



(10) 



One immediate consequence of this lemma is that as 5 — > 0, for (j, k) in I c our 
statistics T(j,k) converge to (at rate 0(5)), and for (j,k) in J, Tj^fc) are bounded 
away from (at a rate of at least 0(5)). 



Proof 10.2 (Proof of Lemma 10.1) We begin by showing that for all (j,k) in 1° 
with j 7^ k we have 

25 



arctanh (jii(j,k)j — arctanh (^(j^)) 



< 



1 - (Pmax + 5Y 
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The mean value theorem gives us that 



arctanh ( R\{j,k) ) — arctanh ( R2(j.k) 



< 



sup r 



1 -r 2 



R\(j,k) - R 2(j,k) 



where the supremum is taken over r in 
have \Rm{j,k)\ < p max + 5, and Ri(j,k) — R2(j,k) 

1 



Ri{j,k), R2(j,k) 



Note that for m = 1,2, we 



sup r 



Rl(j,k) ~ R2(j,k) 



1 - (Pmax + 5) 



1 — r 2 

Now for (j, k) in I, we again use the mean value theorem: 

1 



< 25, for (j, k) not in I . Thus, 
25 

< 



2 ' 



arctanh (^Ri(j,k)^ — arctanh (^R.2(j,k)J 



> inf r 



1-r 2 



Riti,k) - R2(j,k) 



and our result follows because 



Ri(j,k) ~ R2(j 



AO 



^* IT! 1 T 



25. 



Now we consider convergence of these sample correlation matrices. We show that 
their convergence depends only on the convergence of the sample means (fij), variances 
(<r|), and pairwise inner products. We formalize this in the following lemma. 

Lemma 10.3 Let Xj, j = 1,. . . be random variables. Let fij denote the mean of Xj 
and cr| its variance. Let Rj^ be the correlation between Xj and x/,. For each i, let Xi t . 
be independent realizations with the same distribution as x. ( eg. Xij has the marginal 
distribution of Xj). 

For any given e > ; there exists 5 > such that if 



SUp < \<Tj -<Tj\, \Hj ~fij\, 



R 



then 



Sn Pj<k<p 

Furthermore, one can choose 5 = 0(e) 



o-j(T k OjO k 

Rj,k Rj,k 



< 5 



3>k 



< e 



fir 



fl2l 



Proof 10.4 (Proof of Lemma 10.3) We begin by noting that the distribution of 
Rj^k is independent of fij, fi k , Oj and o~ k . For ease of notation we assume [ij = \L\. = 
and Oj = CTfe = 1 . 
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To see that (11) is sufficient for (|12j) we write Rj ^ ^ 



R 



R 



-R 



o-jO- k 



J fe 



< 



+ 



1 n 

— ^ ^ •Ei,j%i,k Rj,k 



i=l 



+ 









We /irsi note t/iat | ^ 2~^iLi %i,jXi : k 



Rj,k\ < 5. Thus we need only consider 



HJ^k 



and 



Expanding these terms using the fact that 1/(1 — 5) = 1 + 0(5), it is 

straightforward to see that the whole expression converges to at rate 0(5). This 
completes our proof. 

Now that we have reduced convergence to that of the sample mean, variance, 
and inner products, we show particular circumstances under which our estimation is 
consistent, and give rates of convergence. 



Lemma 10.5 Let Xj, j 

such that for allt >0 



1, . . . be random variables. Assume there is some C > 



P (\ Xj - E[ Xj ]\ >t)< exp (1 - t 2 /C 2 ) 

(These are known as sub-Gaussian random variables). Let fij denote the mean of xj 
and er? its variance. Let Rj^ be the correlation between Xj and Xk- For each i, let Xi v 
be independent realizations with the same distribution as x. 

Let 5, e p > be given. Then for n sufficiently large and sufficiently small we 
have that 



sup 



a 3\ > 



Hj fj,j\ , 



R 



O-jCTk 



o-jCr k 



< 5 



(13) 



1 /2 

with probability greater than 1 — e p . In particular one can choose 5 = (logp/n) 

The class of subgaussian random variables is rather broad, containing gaussian ran- 
dom variables and all bounded random variables. Applying this lemma, we are able 
to show consistency for the wide class of variables with sufficiently light tails. 

1/2 

In the proof of this lemma we get a convergence rate of 5 = O (log p/n) 1 . This 
rate agrees with the literature for other similar problems in covariance estimation 
(Bickel and Levina 2008| among others). 
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Proof 10.6 (Proof of Lemma |10.5[ ) We will begin by bounding — 
consider Lemma 5.10 of Vershynin '20101 we see that 



If we 



Ct 2 )n 



P (\fij — > t) < e ■ exp 



where C is some function of C ( one can prove this Hoeffding type inequality by an 
exponential Markov argument). Applying the union bound to this we see that 



P (sup^p \flj — > t) < 3pexp 



Ct 2 )n 



If we set t = (^Jl/C) \/ a± ^ m then we have 

P (sup^p \p,j — fij\ > t) < e 1 

bounding \ fij — fij\- 

Next we bound \&j — 0"j|. We first note that 



-9 



\ a 3 ~ °i\ 



\± 2 2 "2 2 

h - a j\ < h - a j\ 



0~j + Q-j 



a 



3 



because o^Oj > 0. so we need only consider convergence of a 2 — a 2 . Next note that 



i i 

So now if we can bound |^ £\ (xy — fij) 2 — o 2 \ and (xj — fij) 2 , then we can bound 



ni -oil 



To bound | ^ £V (xy — /i,-) 2 — cr|| ; we /irsi note i/iai if xy sub-Gaussian th 
(xy — /ij) 2 subexponential; ie 

P ((xy - /ij) 2 - <Ti > t) < exp (-Cit) 



/or some /ixec? C\. Now we apply Corollary 5.17 of Vershynin \201 1, and get that f 
any t sufficiently small (independent of n) 



or 



•Cit 1 
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for some fixed C\. Bounding (xj — fij) 2 is also quite straightforward (we just use the 
bound for \xj — to\) 



P {{xj — fij) 2 >t)< eexp — (ptj n 



We note that for t < 1, t 2 < t. Let C = min{Ci, C}. Now, combining these inequali- 
ties with the triangle inequality we have 



- Ct)n 



+ 2 exp 



(-(V) 



P{\a 2 -a 2 \ >t)< eexp 

< 5 exp [-Ct 2 n] 
for t sufficiently small. Now finally, 

P _ a .\ > t )<P (\a 2 - a 2 \ > ta miQ ) < 5 exp [-Ca 2 min t 2 n] . 
Using the union bound again, we get 

P (supj \a 2 - a 2 \ > t) < 5pexp [-Ct 2 n] . 



so 



Finally, we need to bound 



P (sup,,- |<7j — Gj\ >t) < 5pexp [— Ccr^ in t 2 n] . 

( 1 / n )Y, i <„ x i,j x i,k njnk 



Pj,k 



This is slightly trickier 



but still not terrible. We first note that 

O-M^Xijx^k - fijfik = {I /n) ( x i,j - to) (xi,k - to) 



i<n 



i<n 



We also see that 

2 Yl ( Xi 'i ~~ Vj) (xi,k ~to) = Yl ~ Pi) + ( X i,k ~ to)f 



i<n 



i<n 



i<n 



i<n 



Now to bound the above quantity we consider the moment generating function of 
x i,j ~ to + x i,k ~ to- This not necessarily the sum of independent random variables, 
still by Cauchy Schwartz we have 

E [exp [t (xij - fXj + x itk - /i k )W 

< max {E [exp [2t (x id - //,)]] , E [exp [2t (x ijk - fi k )]]} 
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It is a well known fact that sub-gaussan random variables can be charaterized by their 
MGF (shown in Vershynin ^2010$ ), and this is still the moment generating function 
of a subgaussian random variable. Thus, (xij — fa + — is sub- exponential, 
and again by Corollary 5.17 of Vershynin \201 1 we have that 

p ^ \ ( Xi > j ~ W + x i,k - ^kf - o] -<y\- 2<jj<?kPj,k > tj 
< 2 exp [-C 2 t 2 n] . 

fort > sufficiently small and some fixed C 2 > 0. Now, stringing all of these together 
with the triangle inequality we have that 



> 3t 



< P 



i<n 

( ~ ( Xi >3 ~ ^ + Xi ' k ~ ^ ~ °i +(J k~ 2a 3 a kPj,k > t J 

+p ( ll2( Xi >j-vj) 2 -^ >*) + p ( lJ2^ k -^f-4 >A 

\ i<n / \ i<n / 

< 2exp [-C 2 t 2 n] + 2 * 5exp [-Ct 2 n] 

< 12exp [-dt 2 n\ 

for allt > sufficiently small with some fixed C\ > 0. Taking this a step further, and 
applying the union bound, we see that 



P \ sup jjk 
for some fixed C 2 . 



OjO k 



0-j(Tk 



Pj,k 



> t < 12p 2 exp [-C 2 t 2 n] 



Now that we have bounded each term, we see that (13) happens with probability at 
most 

Yip 2 exp [~C 2 5 2 n] + 2* 5pexp [-Ca^SPn] + 2* 3pexp -C5 2 n 
< 28p 2 exp [-C5 2 n] 

for 5 sufficiently small where C = min {cV^m; C 2 , c\. Thus, if 5 = ( <?+ ^, 1 ° gp ) 1 ^ 2 



then we have (J 1 3|) with probability at least 1 — 28e q . If n is sufficiently large, and 

n 



sufficiently small, then for any q, 5 can be made arbitrarily small. 



27 



Now, we combine these lemmas to show that under certain conditions, for a given 
cutoff t, as n — > oo if \ogp/n — > then, with probability approaching 1, all true 
marginal interactions have |Tjj| > t, and all null statistics will have |Tjj| < t (ie. we 
asymptotically find all true interactions and make no false rejections). 

Before we begin, it deserves mention that we use slightly different notation than 
in the discussion of our algorithm in Section 3. Rather than having X^. denote the 
i-th observation overall, and letting y(i) denote its group (where i ranged from f to 
the total number of observations in both groups), we split up our observations by 
group, letting x m ^^ denote the i-th observation from group m (now i ranges from 
f to the total number of observations in group m). This change simplifies notation 
in the statement of the theorem and its proof. We also assume equal group sizes 
{rii = n 2 = n), this again simplifies notation but can be relaxed to n\/(n\ + n 2 ) — > 
a E (0, f). 



Proof 10.7 (Proof of Theorem 6.1) This result is a straightforward corollary of 
our 3 lemmas: 

First choose an arbitrary e p > 0, and < t < A min . // we consider Lemma 10.3, 



we see that the conclusion of our theorem holds if we can find a bound on the sup- 
norm distance between each correlation matrix and its MLE (a bound I will call 8\) 
which satisfies 

28 l 



max 



\2 " 



28i \ < t. 



Because p Tl 



1 - (Pmax + $1 

< 1, Si > sufficiently small will satisfy this. 



Now applying Lemma 10.5: if we choose 8 2 sufficiently small (but still ofO(Si)), 
then if 



SUp < \Oj — <Jj \ , — jJ>j\ 



0~jO- k O j O k 



pj,k 



< 8, 



(14) 



we have that the sup norm distance between each correlation matrix and its MLE is 
bounded by 5\: for m = 1, 2 



R"m R"m 



<8 1 



Finally, by Lemma. ' 10.1, we see that if n is sufficiently large and logp/n is suffi- 
ciently small then (14) holds with probability at least 1 — e p . This finishes our proof. 
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10.1 Proofs of Permutation Results 

To begin, we prove a Lemma which does most of the leg-work for our eventual theorem. 
It says that for a reasonably balanced permutation, for n sufficiently large and logp/n 
sufficiently small, both of our permuted sample correlation matrices will be very close 
to the average of the 2 population correlation matrices. 

Lemma 10.8 Let x^ and x 2 (j), j — 1,. . . be random variables with 

P(\x m{j) -E[x m{j) ] | > t) <l-e t2/c 

for all t > 0, and each m = 1,2, with some fixed C > 0. Let H m (j) denote the mean 
of x m (j) and cr^^ its variance. For each i < oo, let x m ^^ be independent realizations 
with the same distribution as x m ^.y 

Let p n be a sequence of integers such that lo ^ Pn — > 0. Let R m be the correlation 
"matrix" (an infinite but countably indexed matrix) of the covariates from class m. 
Define R pcrm to be the average of the two, 



\ + 1. 

2 2 



Rperm — ~Rl + -R2 



Let fi m (j) and be the pre-permuted estimates of the mean and variance (in each 

class): 



f'Mj) X m(i,j) 
i<n 



and 



Further, define 



i<n 



our permuted correlation between covariates j and k in class m. 

Assume for every j , a? > <r^ n > 0. Now for any e > 0, 5 > 0, one can find n 



sufficiently large such that for any permutation, U with 

1 



7T 

2 



5 

< — 
~ 12 
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(where rr is the proportion of class 1 that remains fixed under U). We have 



Rperm Rperm.m 



< 5 



(15) 



for both m — 1, 2 with probability at least 1 



Proof 10.9 (Proof of Lemma 10.8) We first consider only m = 1. If we can show 
that 



Rperm Rperm.m 



< s 



(16) 



with high probability for m — 1, then by symmetry we have it for m = 2, and by a 
simple union bound we have it for both simultaneously. 



Now, we begin by decomposing our sample permuted correlation matrix 



R 



perm: 1 



7TR 



E 

(t,m)en(-,i) 
(i) ,/ 

perm:\ 1 V 



a 



m(J) 



a 



m(k) 



7T) R 



(2) 

perm: 1 



where Rpl rm i is a matrix defined by 



R 



(0 

perm:l(j,k) 



1 E 



ien(z,i) 



x l(i,k) ~ Al(fc)\ 
o-i(fc) / 



(17) 



where hi is the number of elements from group I permuted to group 1 (ie. the cardinal- 



ity of 11(1,1) or more explicitly hi = Tin and hi = (1 — rm). The quantity (17) is just 
the contribution from observations originally in class I to the permuted correlation 
matrix for class 1. Thus by the triangle inequality 



R 



perm 



R 



perm: 1 



< 



< 



+ 



1 



- ttR 



(1) 

perm: 1 



1X\ ^perm: 1 



+ 

oo 

1 

+ 2 

oo Z 



-R 2 — (1 — 7r) -Rp2~ m: i 



;is) 



Ro — R 



(2) 

perm:l 



7T — 



^perm: 1 



(2) 

perm: 1 



//we consider i2£2m-iJ we see that it is essentially a sample correlation matrix (using 
only the fm observations that were fixed in class 1 by H for the inner product). We 
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can make a similar observation for R 



(2) 

perm: 1 ■ 



Now, for n sufficiently large, because 



7r| is small, we can make fm and (1 — fc) n as large as we would like. Thus, by a 



combination of Lemma 10.1 and Lemma 10.5, we have that 



Ri — R 



(0 

perm: 1 



< 5/3 



with probability greater than 1 — e/3. Furthermore, using the same Lemmas we get 



R 



(i) 

perm:! 



+ 



R 



(2) 

perm: 1 



< 4 



with probability at least 1 — e/3 (this bound can easily be made tighter, and if we were 
to standardize within permutation this bound is trivial). Plugging this in with the 
assumed bound on \n — \ \ completes the proof. 



Now, we use this Lemma (along with some of our previous Lemmas) to show 
that for any fixed t > if our variables are subgaussian with some other minor 
conditions, then for n — > oo and logp/n — > with probability approaching 1, none of 
our permuted statistics will be larger than t, or in other words our estimated FDR 
will converge to 0. 

Proof 10.10 (Proof of Theorem 6.2) First we choose an arbitrary 2e p > and 
t > 0. If we consider Lemma 7.1, we see that if we find some 5 > satisfying 



Rperm Rperm:m 



for m = 1, 2 with probability at least 1 — e p and 
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1 - (Pmax + <5) 2 



< t 



(19) 



(20) 



then we have satisfied our claim. Because, p max < 1, there exists some 5 > satis- 



fying (20). Now, we first note that, for n sufficiently large, standard concentration 



inequalities give us that 



71 



< 8/12 



with probability greater than 1 — t v . If we apply Lemma 7.5 with this bound on ft and 
combine the probabilities with the union bound, we get that for n sufficiently large 



(19) is violated with at most probability 2e p . This completes our proof. 
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